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ABSTRACT 


A research experiment was conducted to determine whether various 
combinations of training methodologies and speaking voices would affect 
recognition accuracies amongst unique speaker dependent speech recognition 
(SR) systems. The experiment used a SR system (VOTAN VTR 60S50II) which is 
based on VOTAN (proprietary) technology. Ten subjects trained five different 
voice patterns each and conducted four natural voice tests to compile statistics 
about the recognition accuracy for each pattern. Two patterns (natural voice and 
declarative voice) were retested using a declarative voice. 

The experiment was successful and demonstrated that different 
combinations of training methodologies and speaking voices can significantly 
affect the performance of unique discrete dependent SR systems. This thesis 
discusses the research methodology. reviews and analyzes the data collected, and 
States conclusions drawn about the particular dependent SR system used in the 


experiment. 
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I. INTRODUCTION 


A research experiment was conducted to determine whether various 
combinations of training methodologies and speaking voices would affect 
recognition accuracies amongst unique speaker dependent SR systems. The 
experiment used a SR system (VOTAN VTR 6050II) which is based on VOTAN 
(proprietary) technology. Ten subjects trained five different voice patterns each 
and conducted four natural voice tests to compile statistics about the recognition 
accuracy for each pattern. Two patterns (natural voice and declarative voice) 
were retested using a declarative voice. Statistics were compiled on the 
interaction of these independent variables. This thesis discusses the research 
methodology. reviews and analyzes the data collected. and states conclusions 


drawn about the particular dependent SR system used in the experiment. 


A. BACKGROUND 

This expenment was conducted as follow-on research based on a thesis 
completed in March 1991 by CDR Richard L. Miller. Each SR = system's 
performance is dependent on whether its algorithms can accurately capture an 
individual's speech characteristics and later match them to spoken words. The 
Miller thesis sought to determine whether a dependent SR system's word 
recognition accuracy would vary significantly with the training method used. 
Miller's research found a definite relationship between training method and 
recognition accuracy (Miller. 1991). 

A common mistake when using SR equipment is talking too meekly to the 


system. The system can't recognize what it can't hear (Poock. 1990). Failure to 


speak loudly enough causes problems not only during system operation but 
especially during template training. Declarative speech normally eliminates this 
problem by naturally causing the speaker to raise his voice. The original research 
was duplicated with the addition of two new voice patterns. Five types of voice 
patterns were tested using a natural voice input. In addition, the two patterns 
which performed best in terms of recognition accuracy were retested using a 


declarative voice input. 


B. PROBLEM 

Do optimal training methods exist and if so do they differ amongst unique 
discrete/dependent SR systems? Each dependent SR system is individualistic as 
defined by the type of algorithms it uses to produce voice templates. An optimal 
training method for one system may not be the best for other systems. Is it 
possible to quickly determine an optimal training method for each SR system? 
Natural voice training is an intuitive method to start with but is it optimal or at 
least "good enough" when compared to other training methods’? 

If training methods affect recognition accuracy. a logical follow-on question 
would be: Can how an individual “speaks” to the computer affect a system's 
performance? Vendors generally recommend training their SR systems in a 
natural voice but don't discuss how to speak to the computer during operational 
use. This thesis addresses these questions as they apply to one specific 


discrete/dependent SR system. 


C. SCOPE OF THE THESIS 
The objective of the thesis is to determine whether there is any statistically 


significant difference in performance between five different training 


methodologies. while using two speech types to test a specific, dependent SR 
system. Training methodologies that are the same as those tested during the 
Miller research will be compared to determine if a common optimal training 


method exists. 


D. LIMITATIONS 
Time limitations precluded conducting the experiment on more than one type 
of dependent SR system. The results herein are system specific and cannot be 


generalized for a// dependent SR systems. 


Il. EXPERIMENT PROCEDURE 


A. SUBJECTS 

Ten subjects (two female, eight male) participated in this study. One of the 
female subjects was a civilian. The remaining subjects were military officers who 
were enrolled at the Naval Postgraduate School in Monterey, California. Some 
subjects had educational knowledge of SR systems, but none had actual 


experience using a SR system before this experiment. 


B. SRSYSTEM 

The SR system chosen was a stand-alone. off-the-shelf product called 
‘VOTAN VTR 60501]. which is based on VOTAN SR technology. The algonthm 
used in the VIR 6OS50II speech drivers is propnetary. The SR system allows 
manipulation of two parameters: input gain. and acceptance level. The 
acceptance level can be set on a scale of 0-255 and allows comparison of the 
spoken utterance with a given template to determine if the accuracy of match is 
equal to or exceeds the chosen level. A level of zero would require a perfect 
match while a level of 255 would result in any utterance being recognized. The 
level was set at the vendor's recommendation, of 50 for this experiment (e.g. if the 
SR system's algorithm determined a value of 50 or less for a utterance match, it 
would display the word). The input gain allows the user to decrease input gain 
when using the system in a noisy environment. The gain could be adjusted ina 
range of values 1-5. The nosier the environment the Jower the input gain should 


be. Input gain was set at a value of 2 even though the experiment was 


conducted in a sound proof booth. The system displayed warning messages if 
the input gain was too high or low. 
A noise-cancelling. “boom” microphone mounted on a headset was used for 


voice input to the system. 


C. EXPERIMENT DESIGN 

Each subject was given instructions on how to train the SR system. A dumb 
computer monitor displayed the word being trained and warning messages if the 
input gain was too low/high. The VOTAN VTR 60SOII voice card has limited 
memory capacity and can accept up to 5O words at a time if three training passes 
are made to create each template. The vendor recommended a set of no more 
than 20 words in order to enhance recognition and response time. The same 
vocabulary list of 90 words (Appendix A) used in the Miller study was used to 
create each template. Due to the memory limitations of the voice card, this list 
was broken into three separate 30 word lists. Each subject conducted three 
training passes per template to create five voice templates of each word. Pattern 
#}--‘natural’: Pattern #2--‘artificial inflection’: and Pattern #3--‘rapid-speak’: 
Pattern #4--‘interrogative'; Pattern #5--‘declarative (see the Testing section 
which follows). 

Each subject conducted, on four separate occasions, a series of test runs 
against their templates using a natural voice. One test run against each template 
was conducted during each trial session (total of five test runs for each mal: 4 
tnals x 5 templates =20 test runs for each subject; total of 20 x 10 subjects = 200 
trials). Each template was loaded into the SR system in random order and the 
subjects were instructed to say each word on the vocabulary list one time. The 


order of the vocabulary words was modified for each trial to create as much 


randomness as possible. The subjects were not allowed to view the computer 
monitor during trial runs and were not aware of which voice template they were 
speaking against. 

Pattern #1 and Pattern #5 were retested using the same format but with both 
Voice #1--‘natural’ and Voice #2--‘declarative’ speech inputs (total of two test 
runs for each trial: 4 tnals x 2 voice inputs x 2 templates = 16 test runs for each 


subject; total of 16 x 10 subjects = 160 trials). 


D. PROCEDURE 
1. Training 

Acoustic energy which is produced during speech is affected by changes 
in loudness, pitch, rate of speech. stress and vocal quality (Tiffany, Carrell, 1977). 
Each of the five types of templates attempt to take advantage of one or more of 
these speech qualities. A SR system 1s dependent on distinctive changes in voice 
characteristics to produce reliable matching of templates to speech inputs. 
Templates are more reliable if distinctive vocal features can be incorporated to 
produce them (Dixon, Martin. 1979). The training templates consisted of 90 
vocabulary words. repeated three times by each subject (90x3x10 subjects = 
2/00 utterances). Each subject created their own, unique templates. Pattern #1, 
#2 and #3 templates were created in the same manner as they were for the Miller 
study. Pattern #4 (interrogative) had each subject speak each word as if asking a 
question. This produced an exaggerated upward or downward inflection on 
each of the three repetitions. An interrogative type statement will naturally 
produce either an upward or downward inflection at the end of a word (Tiffany, 
Carrell, 1977). Pattern #5’s templates (declarative) were created in the same 


manner. each subject speaking the words as if giving the computer a command. A 


command type utterance seems to involve an enhancement of all of the speech 
qualities mentioned above. 

During training. the VOTAN system allowed the researcher to accept or 
reject each utterance by a subject. Acceptance was purely subjective except in 
the case of input gain being too low/high. The system provided no feedback as 
to the similarity of utterances. After accepting three repetitions of the utterance, 
the voice template was saved to computer memory disk. These templates were 
later input into the system's speech analyzer to test for recognition accuracy. The 
training procedure took approximately 90 minutes for each subject to train all five 
voice patterns. 

2. Testing 

Testing began approximately one week after all subjects had completed 
creating their templates. Each of the 10 subjects initially conducted four trials 
each using a natural speaking vcice. A trial consisted of five test runs (one for 
each template). The natural and declarative voice templates were retested using a 
declarative speaking voice. Testing was made as random as possible. Templates 
were loaded into the SR system in a random order and each subject read through 
a corresponding list of vocabulary words. Six lists of vocabulary words were 
available for each set of 30 words. Words were arranged randomly on each list 
and each subject was directed to select a different list during each of the four 
trials. Subjects weren't allowed to know which template was loaded and were 
not allowed to view the monitor during testing. 

During each trial, statistics were recorded as to number of correct 


recognitions, misrecognitions and nonrecognitions (for the purposes of this thesis. 


misrecognitions and nonrecognitions were grouped together and counted as 


inaccurate recognitions by the SR system). 


E. INDEPENDENT AND DEPENDENT VARIABLES 
The independent variables were: pattern (one, two, three. four and five), trial 
(one through four). voice (one, and two) and subjects (1-10). The dependent 


variable was accuracy. 


Hil, RESULTS 


A. OVERVIEW 

This section descnbes the results of the experiment. The analysis of variance 
and Duncan Range tests were performed using the arc sin transformation of 
relative difference scores to stabilize the variance of the error terms (Neter and 
Wasserman, 1974). The SR recognition accuracy figures that appear in charts, 
however, are expressed as percentages and are untransformed. 

From a Statistician’ s viewpoint. the null hypothesis in this experiment was 
that all training methods for a dependent SR system would result in equivalent 
performance. 

1. Analysis of Variance 

Table | and Table [] present respectively the three-way and four-way 
analysis of variance summary tables for recognition accuracy (arc. sin 
transformation of raw data). F-ratios in Table I indicate that while the ‘pattern’ 
and ‘subject’ vanables and their combination had significant effects on the 
results, ‘trials’ had no appreciable effect. The F-ratios in Table I] again show that 
‘tnals’ had no significant effect on the results while ‘pattern.’ ‘subject.’ ‘voice’ 
and their two-way interactions did. The three-way interaction of ‘subject’ - 


‘pattern '-‘voice’ was not significant. 


2. Impact of Variables 
a. ‘Subject’ Variable 
As expected, variability existed between subjects in regard to 
which patterns and type voice performed better. however their variance is 
isolated in this design. 
b. ‘Trial’ Variable 
The ‘trial’ variable had no significant affect in either phase of this 
study. Words were arranged randomly on each vocabulary list and this 


apparently eliminated any "learning" by the subjects. 


TABLE I 
ANALYSIS OF VARIANCE SUMMARY TABLE 


USING NATURAL VOICE INPUT AGAINST 
FIVE TY PES @PREFERENGE PAT PEKNS 


source jot sss Fermtio [Prob 
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TABLE Il 
ANALYSIS OF VARIANCESUMMARY TABLE 


USING DECLARATIVE VOICEINPUT AGAINST 
TWOTY PES OF REFERENCE PATTERNS 


Source lat iss [ms Fxratio [prob 
pavem [1 [3.5701 [3.5701 1.99 boron 
sup fo faon.seti [22760 |24s ooo 
Voice |_| 203776 [203776 [1134 [0.023 


TialSubj [27 |35.si7__ {1.3019 22 jas 
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Subj.Pattn. 50.8292 1.8826 
Trial 
iin. Voice.|3 28027 | 0.9642 
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c. ‘Pattern’ Variable 
The ‘pattern’ variable has a significant effect on performance, as 
depicted in Figures 1, 2 and 3. Figures | and 2 show the differences in pattern 
performance for each subject. Figure 3 shows the effect that the interaction of 
pattern and voice had on performance.. To further isolate and analyze the 
‘pattern’ variable, Duncan’s Multiple-Range test was conducted. The results of 
the test are summarized in TABLES III and IV. Note that there is no significant 
difference in percent accuracy between the natural and declarative patterns 
(Pattern #1 vs Pattern #5) when tested with a natural speech input (Table IJ). 
d. ‘Voice’ Variable 
The natural (Pattern #1) and declarative (Pattern #5) patterns were 
retested using a declarative voice. Figure 3 demonstrates that the interaction of 
input voice type and pattern type did significantly effect percent accuracy. Table 
IV shows the Duncan Range analysis of means for the two voice types. A 
declarative voice (Voice #2) takes advantage of all the positive qualities of 
spoken speech and seems to improve performance when used as a speech input 
even though there was no appreciable difference between the natural and 


declarative patterns using a natural input voice (Voice #1). 
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(Patterns: 1 = natural, 2 = artificial inflection. 3 = rapid-speak, 


4 = interrogative. 5 = declarative) 
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(Pattems: 1 = natural, 5 = declarative) 


(Voices: ] = natural, 2 = declarative) 
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Duncan's Multiple Range Test for Variable : ACCURACY 
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TABLE IV 
Duncan's Range Test for Variable: ACCURACY 


Declarative and Natural Patterns 
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B. DISCUSSION 

This experiment did evaluate the overall SR accuracy of five training methods 
by using a natural speaking voice input into the VOTAN VTR 60501] system. 
Patterns one and five were not significantly different when compared to each 
other but were appreciably better than the other three patterns (Table III). This 
supports the Miller study which found that a natural voice pattern performed 
best. The recommendation in the SR system’s documentation was to train the 
system in a firm, natural voice. The declarative voice pattern was an attempt to 
interpret these recommendations. The natural and declarative patterns were 


consistently accurate for all subjects. Patterns two and three did not perform as 


well and were not as consistent . The rapid speech pattern in both studies was 
clearly not as robust as any of the other patterns. 

After determining that patterns one and five clearly resulted in more accurate 
recognitions, the subjects retested patterns one and five using a declarative voice 
input. As indicated by Figures 3 and 4, the declarative voice input significantly 


improved the performance both patterns achieved with a natural voice input. 


-catemndl} 


me wo fF OF Ho NN CO WO OD 


CJ) Voice 2 


Subject 


Voice 


— 





% Accuracy 


Figure 4. Effect of Voice on Average Performance 


(Voices: | = natural, 2 = declarative) 
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IV. CONCLUSIONS 


In summary, subjects, as expected impacted performance, but their variance 
was isolated for this experiment’s design. The trial variable had no effect on this 
study. The effect of pattern. input voice and their interaction did significantly 
impact performance of the system. 

All patterns, with the exception of rapid speech, performed reasonably well. 
However, the natural and declarative templates clearly achieved the best 
recognition accuracy. Subjects tended to have difficulty producing the pattern 
two and four templates. Each subject had several utterances rejected because 
they weren't able to produce the correct inflection, utterances weren't loud 
enough, etc. Producing training templates must be an easy. straight-forward and 
intuitive process if SR systems are to be readily accepted in the market place. 
Training in a natural voice is an obvious starting point and may produce 
acceptable results but as demonstrated in both studies, there are a wealth of 
different methods that could be used. There is not an obvious. or simple way to 
determine a SR system's optimal training method without conducting experiments 
similar to this one because each system's algorithms are different. 

This experiment demonstrated that recognition accuracy is also dependent on 
the type of voice used during system operation. Changing from a natural to a 
declarative voice during testing appreciably improved the system's performance. 
Declarative utterances are very intuitive to make and generate subtle differences 
in syllable stress, cadence, inflection and loudness. In this case, a declarative 
template combined with a natural voice input produced accuracies that were not 


significantly different from those produced by a natural template and a natural 
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voice input. However, a declarative template combined with a declarative voice 
input was significantly better than any pattern or combination that was tested. 

Does this mean that al] systems should be trained and operated using a 
declarative voice? Not necessarily because each system is different. Again it's a 
reasonable method to start with and may produce acceptable or even optimal 
results depending on the SR system. Manufacturers of SR systems should test 
their systems using a variety of training methods and input voices to determine 
the best method for their specific system. They should then give concise and 
easily understood instructions on the best method to train and use their system. 
Vague or difficult to grasp directions do little to improve performance of the 
systems and can actually hinder it. The bottom line is customer satisfaction and a 
little research and documentation up front can go a long way to improve the 
acceptance of speech recognition systems. 

The Naval Postgraduate School has many different state-of-the-art speech 
recognition systems and this writer would recommend that support from sponsors 
be provided to further resolve the questions posed in this thesis. The point of 


contact at NPS would be this writer's thesis advisor. 
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